Frame Aggregation and Multi-modal Fusion Framework for Video-Based Person Recognition
نویسندگان
چکیده
Video-based person recognition is challenging due to persons being blocked and blurred, the variation of shooting angle. Previous research always focused on still images, ignoring similarity continuity between video frames. To tackle challenges above, we propose a novel Frame Aggregation Multi-Modal Fusion (FAMF) framework for video-based recognition, which aggregates face features incorporates them with multi-modal information identify in videos. For frame aggregation, trainable layer based NetVLAD (named AttentionVLAD), takes arbitrary number as input computes fixed-length aggregated feature quality. We show that introducing an attention mechanism into effectively decreases impact low-quality multi-model videos, Multi-Layer Attention (MLMA) module learn correlation multi-modality by adaptively updating Gram matrix. Experimental results iQIYI-VID-2019 dataset our outperforms other state-of-the-art methods.
منابع مشابه
tight frame approximation for multi-frames and super-frames
در این پایان نامه یک مولد برای چند قاب یا ابر قاب تولید شده تحت عمل نمایش یکانی تصویر برای گروه های شمارش پذیر گسسته بررسی خواهد شد. مثال هایی از این قاب ها چند قاب های گابور، ابرقاب های گابور و قاب هایی برای زیرفضاهای انتقال پایاست. نشان می دهیم که مولد چند قاب تنک نرمال شده (ابرقاب) یکتا وجود دارد به طوری که مینیمم فاصله را از ان دارد. همچنین مسایل مشابه برای قاب های دوگان مطرح شده و برخی ...
15 صفحه اولMulti-modal Person Recognition for Vehicular Applications
In this paper, we present biometric person recognition experiments in a real-world car environment using speech, face, and driving signals. We have performed experiments on a subset of the in-car CIAIR corpus collected at the Nagoya University, Japan. We have used Mel-frequency cepstral coefficients (MFCC) for speaker recognition. For face recognition, we have reduced the feature dimension of e...
متن کاملMulti-modal Aggregation for Video Classification
In this paper, we present a solution to Large-Scale Video Classification Challenge (LSVC2017) [1] that ranked the 1st place. We focused on a variety of modalities that cover visual, motion and audio. Also, we visualized the aggregation process to better understand how each modality takes effect. Among the extracted modalities, we found Temporal-Spatial features calculated by 3D convolution quit...
متن کاملMulti-modal analysis for person type classification in news video
Classifying the identities of people appearing in broadcast news video into anchor, reporter, or news subject is an important topic in high-level video analysis. Given the visual resemblance of different types of people, this work explores multi-modal features derived from a variety of evidences, such as the speech identity, transcript clues, temporal video structure, named entities, and uses a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-67832-6_7